
The design of fast algorithms for (\ref{gt}) is strongly dependent on three independent parameters viz., the number of sources $N$, the bandwidth $\delta$ and desired accuracy $\epsilon$. A Gaussian centered at a source location interacts with targets that are within its support. We use the {\em truncation algorithm} if the number of targets is less than a threshold value $n^*$ and we use the {\em expansion algorithm}, otherwise. The threshold value depends on all three independent parameters and we will discuss its choice after introducing both algorithms. 

\subsection{Truncation algorithm} 
We introduce a few definitions first.
\begin{mydef}[Support of the kernel] {\em The region around a target $x$ where the kernel centered at $x$ is greater than $\epsilon$ is called the support of the kernel and we denote it by $\omega$.}
\end{mydef}

Support of the Gaussian kernel, for instance, is the cube centered at $x$ with side length $2 \sqrt{\delta \ln (1/\epsilon)}$. We denote the volume enclosed by the $\omega$ as $|\omega|$.

\begin{mydef}[Interaction list] {\em For any target point $x$, all the sources that are within the support of the kernel centered at $x$ belong to the interaction list of $x$ denoted by $\mathcal{I}[x]$.}
\end{mydef}
In the truncation algorithm, we simply truncate the sum (\ref{gt}) to
%
\beq F(x_j) = \sum_{y_k \in \mathcal{I}[x_j]} \kernel f(y_k). \label{eqn:truncation} \eeq
%
There are $\bigO(N |\omega|)$ sources inside $\mathcal{I}[x_j]$ and hence, the complexity of this algorithm is $\bigO(N^2 |\omega|)$. When the bandwidth $\delta$ is small and/or the $\epsilon$ is large, the support $\omega$ shrinks and the complexity approaches asymptotically to $\bigO(N)$. This is a common technique used in the graphics community (Gaussian blurring). When $\omega$ is comparable to the size of the domain (unit cube) or when fraction of points within the support grows beyond a threshold $n^*$, clearly, the cost of this algorithm grows quadratically. The expansion algorithm leads to optimal complexity in those cases. We discuss this next.

\subsection{Expansion algorithm}
Broadly speaking, FGT algorithms \cite{fgt, greengard98} partition the domain into boxes of uniform size and proceed in three main steps: (i) compress the influence of sources into a few multipole-type moments, (ii) translate the moments across the domain and (iii) evaluate the moments at the targets. In the original FGT \cite{fgt}, the translation costs tend to dominate \cite{fggt}. The plane-wave expansion version \cite{greengard98} minimized the translation costs to a large extent but at the expense of higher computational cost for steps (i) and (iii). This problem is alleviated in \cite{fggt} to a large extent. Our implementation is based on \cite{fggt} and we summarize it here. 

Central to the expansion algorithm is the finite-term plane-wave expansion of the kernel,
\beq G_\delta(\norm{x_j - y_k}) \approx \sum_{|k| \leq p} \hat{G}(k) e^{i \lambda k \cdot (x_j - y_k)}, \quad \lambda = \frac{L}{p\sqrt{\delta}}\eeq
where $k = (k_1, k_2, k_3)$ and the parameters $p$ and $L$ are determined by the required precision. $\hat{G}$ is the discrete Fourier transform of the kernel. For the Gaussian kernel, it is given by
\beq \hat{G}(k) = \left(\frac{L }{2p\sqrt{\pi}}\right)^3 e^{-\frac{\lambda^2 |k|^2 \delta}{4}}. \label{eqn:ghat}\eeq

The algorithm begins by partitioning the domain into uniform boxes of size $\sqrt{\delta}$ each. The total number of boxes denoted by $|B|$ are then given by $|B| = (1/\sqrt{\delta})^3$. The kernel located at the center of a box $B$ decays below $\epsilon$ beyond a fixed number of boxes. We denote this fixed number in 1D by $K$ ($K^d$ in $d$ dimensions). 

\begin{mydef}[Interaction list] {\em The interaction list of a box $B$ denoted by $\mathcal{I}[B]$ is the set of all boxes that are covered by the support of the kernel located at the center of $B$. }
\end{mydef} 
%
Using these definitions, we are now ready to state the algorithm. A target $x \in D$ receives information from a source $y \in B$ in three steps:
\begin{description}
\item[\textbf{S2W}] The influence of all the sources in the box $B$ is condensed into a plane-wave expansion:
            \beq w_k = \sum_{y \in B} f(y) e^{i\lambda k \cdot (c^B - y)} \quad \forall\quad |k| \leq p.  \label{eqn:s2w} \eeq
            
\item[\textbf{W2L}] The plane-wave expansion of each box $B$ is transmitted to every box $D$ in $\mathcal{I}[B]$. By
 superposition, the ``local'' plane-wave expansion of $D$, denoted by $v_k$ for $|k| \leq p$, gets modified as
            \beq v_k += w_k e^{i\lambda k \cdot (c^D - c^B)}. \label{e:w2l}\eeq
            
\item[\textbf{L2T}] The transform (\ref{gt}) is computed at each target by evaluating the local expansion of the 
box it is contained in:
            \beq F(x) = \sum_{|k| \leq p} \hat{G}(k) v_k e^{i\lambda k \cdot (x - c^D)}. \label{eqn:l2t}\eeq
\end{description} 

First, the S2W step is executed separately in each box with $\bigO(p^3 N)$ work. In a {\em direct translation scheme}, 
the W2L is executed by simply visiting each box $B$ and {\em translating} its plane-wave expansion to all the boxes
 in $\mathcal{I}[B]$. Assuming the size of interaction list is $K^3$, this algorithm requires $\bigO(K^3 p^3 |B|)$ work 
 to form local expansions at all the boxes. Finally, the L2T is executed by visiting each box and evaluating the 
 local expansion at the targets, requiring $\bigO(p^3 N)$ work. Therefore, the overall cost of the algorithm is $\bigO(p^3 N + K^3 p^3 |B|)$.  

\subsection{Overall algorithm} 
The translation costs can be reduced dramatically by using the {\em sweeping algorithm} introduced in \cite{greengard98}. We present a modified version of the same in Section \ref{sc:sweep}. The total cost of the expansion algorithm is then typically dominated by the S2W and L2T steps which grow as $\bigO(p^3 N)$. However, when $\delta$ is lower and/or $\epsilon$ is higher, the cost of the expansion algorithm increases because condensing the sources into plane-waves becomes increasingly futile as
 the number sources per box gets reduced. On the other hand, the cost of the truncation algorithm decreases. Hence, we choose between the two algorithms based on the following heuristic:

{\tt
\begin{algorithmic}
\STATE
  \IF {$N |\omega| < (2p)^3 $}
     \STATE Use truncation algorithm 
  \ELSE 
     \STATE Use expansion algorithm
  \ENDIF
\STATE
\end{algorithmic}
}

The complexity of the overall algorithm, therefore, is $\bigO(p^3 N)$ independent of $\delta$ (or $|\omega|$).  

{\em Kernel independence.} All the steps we have described so far are applicable for any Gaussian-type kernel. 
In fact, except the L2T step which requires the Fourier transform of the kernel, the formulas in the remaining steps are the same for any kernel. Given the kernel values at the discrete points $\{Lk/p \}_{|k| \leq p}$, we can compute $\hat{G}$ using the discrete Fourier transform. The parameters $L, p$ and $K$, however, need to be tuned for desired accuracy based on the kernel \cite{fggt}. 

\subsection{Parallel truncation algorithm}
\label{sc:parallelTruncation}
We divide the unit cube into a regular grid of cuboidal blocks and distribute the blocks across $n_p$ processors
 such that \begin{inparaenum}[\itshape a\upshape)]
\item each processor owns exactly one block, and, \item each block is owned by an unique processor\end{inparaenum}.
 For each target owned by a particular processor, we find the processors that contain blocks which intersect the interaction list of that target and send the source information to those processors. Each processor then evaluates (\ref{eqn:truncation}) at all its targets.
 
\subsection{Parallel expansion algorithm} 
\label{sc:parallelExpansion}
We create a regular grid of $|B|$ FGT boxes distributed on $n_p$ processors 
such that {\tt{(a)}} each box is owned by an unique processor and {\tt{(b)}} each processor owns a 
sub-grid of FGT boxes. For ease of implementation, we assume that $|B| > n_p$. 
 We use PETSc's \cite{petsc-home-page} DA module to manage this distributed regular grid.
 The S2W and L2T steps are embarrassingly parallel; in the former step, each processor independently forms the 
 plane-wave expansions for each box that it owns and in the latter step, each processor
 independently computes the transform (\ref{gt}) for each target contained within some box owned by
  that processor using the local expansion for that box. Hence, the cost for these two steps is 
  simply $\bigO(p^3 \frac{N}{n_p})$. For the W2L step, each processor first gathers the plane-wave expansions of
  the boxes that are in the interaction list of some box owned by that processor. The communication cost for
  this step is $\bigO(K (\frac{|B|}{n_p})^{\frac{2}{3}} + K^2(\frac{|B|}{n_p})^{\frac{1}{3}} + K^3 )$.
Following this communication, each processor executes the sequential W2L algorithm for all boxes that it owns. 
 If we use the direct scheme for the sequential W2L algorithm then this cost would be $\bigO(p^3 K^3 \frac{|B|}{n_p})$.
 As mentioned earlier, this cost can be reduced by using the sweeping algorithm, which is described in the following section. 


